Midterm Review

Dr. Lucy D’Agostino McGowan

Overview

Topics Covered

  • Matrix fundamentals for regression
  • Least squares estimation
  • QR decomposition
  • Gauss-Markov theorem
  • Sums of squares decomposition
  • Hypothesis testing

Matrix Operations Review

Basic Matrix Notation

The Linear Model: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

Where:

  • \(\mathbf{y}\) is an \(n \times 1\) vector of responses
  • \(\mathbf{X}\) is an \(n \times p\) design matrix
  • \(\boldsymbol{\beta}\) is a \(p \times 1\) vector of parameters
  • \(\boldsymbol{\varepsilon}\) is an \(n \times 1\) vector of errors

The Design Matrix

\[\mathbf{X} = \begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1p} \\1 & x_{21} & x_{22} & \cdots & x_{2p} \\\vdots & \vdots & \vdots & \ddots & \vdots \\1 & x_{n1} & x_{n2} & \cdots & x_{np}\end{bmatrix}\]

Hat Notation

Parameters vs. Estimates:

  • \(\beta_0, \beta_1\) = true (unknown) parameters
  • \(\hat{\beta}_0, \hat{\beta}_1\) = estimated parameters from data

Observed vs. Predicted:

  • \(y_i\) = observed response values
  • \(\hat{y}_i\) = predicted values from model

Residuals

Definition: Difference between observed and predicted values

\[\hat\varepsilon_i = y_i - \hat{y}_i\]

In matrix form:

\[\hat{\boldsymbol{\varepsilon}} = \mathbf{y} - \hat{\mathbf{y}}\]

Least Squares Estimation

What Are We Minimizing?

Sum of squared errors:

\[\text{SSE} = (\mathbf{y}- \mathbf{X}\boldsymbol\beta)^T(\mathbf{y}- \mathbf{X}\boldsymbol\beta)\]

Deriving the OLS Solution

Take derivative with respect to \(\boldsymbol{\beta}\) and set to zero:

\[\frac{\partial \text{SSE}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]

Rearrange: \[\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}\]

These are the normal equations

The OLS Estimator

Solve for \(\hat{\boldsymbol{\beta}}\):

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

This is the ordinary least squares estimator

Geometric Interpretation

  • Regression finds the projection of \(\mathbf{y}\) onto the column space of \(\mathbf{X}\)
  • \(\hat{\mathbf{y}}\) is the closest point in the column space to \(\mathbf{y}\)
  • Residuals are perpendicular to the column space: \(\mathbf{X}^T\hat{\boldsymbol{\varepsilon}} = \mathbf{0}\)
  • This orthogonality guarantees minimum distance

The Hat Matrix

Definition: \[\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]

What it does: \[\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\]

Hat Matrix Properties

Symmetric: \(\mathbf{H}^T = \mathbf{H}\)

Idempotent: \(\mathbf{H}^2 = \mathbf{H}\)

\(\mathbf{I} - \mathbf{H}\) is also symmetric and idempotent

\[\hat{\boldsymbol{\varepsilon}} = (\mathbf{I} - \mathbf{H})\mathbf{y}\]

QR Decomposition

Why Not Just Use \((X^TX)^{-1}\)?

Problems with the traditional approach:

  • Computing \(\mathbf{X}^T\mathbf{X}\) can be numerically unstable
  • Condition number gets squared: \(\kappa(\mathbf{X}^T\mathbf{X}) = \kappa(\mathbf{X})^2\)

QR decomposition provides a more stable solution

What is QR Decomposition?

Any matrix \(\mathbf{X}\) can be decomposed as:
\[\mathbf{X} = \mathbf{Q}\mathbf{R}\]

Where:

  • \(\mathbf{Q}\) is orthogonal: \(\mathbf{Q}^T\mathbf{Q} = \mathbf{I}\)
  • \(\mathbf{R}\) is upper triangular

QR Solution to Least Squares

Traditional: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

With QR: If \(\mathbf{X} = \mathbf{Q}\mathbf{R}\), then:

\[\begin{align} \hat{\boldsymbol{\beta}} &= (\mathbf{R}^T\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^T\mathbf{y} \\ &= \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \end{align}\]

Solve using back substitution: \(\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^T\mathbf{y}\)

Back Substitution

Upper triangular system is easy to solve:

\[\begin{bmatrix} r_{11} & r_{12} & r_{13} \\ 0 & r_{22} & r_{23} \\ 0 & 0 & r_{33} \end{bmatrix} \begin{bmatrix} \hat\beta_1 \\ \hat\beta_2 \\ \hat\beta_3 \end{bmatrix} = \begin{bmatrix} q_1 \\ q_2 \\ q_3 \end{bmatrix}\]

Back Substitution

Work backwards:

  1. Solve for \(\hat\beta_3\) from bottom row
  2. Substitute into second row, solve for \(\hat\beta_2\)
  3. Substitute into first row, solve for \(\hat\beta_1\)

QR Advantages

  • Only need back substitution, not full matrix inversion
  • More numerically stable
  • Avoids squaring the condition number
  • Hat matrix: \(\mathbf{H} = \mathbf{Q}\mathbf{Q}^T\)

Random Vectors & Gauss-Markov

Properties of Random Vectors

Expected value (linearity): \[E[\mathbf{A}\mathbf{Y} + \mathbf{b}] = \mathbf{A}E[\mathbf{Y}] + \mathbf{b}\]

Variance-covariance transformation: \[\text{Var}(\mathbf{A}\mathbf{Y} + \mathbf{b}) = \mathbf{A}\text{Var}(\mathbf{Y})\mathbf{A}^T\]

Gauss-Markov Assumptions

  1. Linearity: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)

  2. Zero mean errors: \(E[\boldsymbol{\varepsilon}] = \mathbf{0}\)

  3. Homoscedasticity & independence: \(\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\)

  4. Full rank: \(\mathbf{X}\) has full column rank

The Gauss-Markov Theorem

Theorem: Under the GM assumptions, OLS is BLUE

BLUE = Best Linear Unbiased Estimator

  • Linear: \(\hat{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) for some matrix \(\mathbf{C}\)
  • Unbiased: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)
  • Best: Smallest variance among all linear unbiased estimators

OLS is Unbiased

Show: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)

Why OLS Has Minimum Variance

Be able to demonstrate why!

Sums of Squares

The Three Key Quantities

Total Sum of Squares (TSS): \[\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2 = \mathbf{y}^T\mathbf{y} - n\bar{y}^2\]

The Three Key Quantities

Regression Sum of Squares (SS\(_{\text{Reg}}\)): \[\text{SS}_{\text{Reg}} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 = \mathbf{y}^T\mathbf{H}\mathbf{y} - n\bar{y}^2\]

The Three Key Quantities

Residual Sum of Squares (RSS): \[\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}\]

Interpreting the Quantities

TSS: Total variation in the data
“How spread out are my y-values?”

SS\(_{\text{Reg}}\): Variation explained by the model
“How much variation did our model capture?”

RSS: Variation left unexplained
“How much did we miss with our model?”

The Fundamental Identity

Total variation = Explained + Unexplained

\[\text{TSS} = \text{SS}_{\text{Reg}} + \text{RSS}\]

Coefficient of Determination (R²)

Definition: Proportion of variation explained by the model

\[R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\]

Hypothesis Testing

Estimating \(\sigma^2\)

True error variance: \[\text{Var}(\varepsilon_i) = \sigma^2\]

Unbiased estimator: \[\hat{\sigma}^2 = \frac{\text{RSS}}{n-p}\]

where \(n-p\) are the degrees of freedom (observations minus parameters)

Also called Mean Square Error (MSE)

Testing Single Coefficients

Null hypothesis: \(H_0: \beta_j = 0\)

Test statistic: \[t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-p}\]

Testing Single Coefficients

\[\text{se}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}}\]

P-value: \(P(|T| \geq |t|)\) where \(T \sim t_{n-p}\)

Where Does SE Come From?

Recall: \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\)

For individual coefficient \(j\): \[\text{Var}(\hat{\beta}_j) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}\]

Estimate by replacing \(\sigma^2\) with \(\hat{\sigma}^2\): \[\widehat{\text{Var}}(\hat{\beta}_j) = \hat{\sigma}^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}\]

Standard error is the square root: \[\text{se}(\hat{\beta}_j) = \sqrt{\widehat{\text{Var}}(\hat{\beta}_j)}\]

Overall F-Test

Tests whether the model is useful at all

Null: \(H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0\)
(all slopes equal zero, only intercept remains)

Test statistic: \[F = \frac{\text{SS}_{\text{Reg}}/(p-1)}{\text{RSS}/(n-p)} \sim F_{p-1, n-p}\]

F-Test Interpretation

\[F = \frac{\text{Mean Square Regression}}{\text{Mean Square Error}}\]

Large F: Model explains a lot relative to noise
Small F: Model doesn’t explain much more than noise

Connection to R²

Alternative F-statistic form: \[F = \frac{R^2/(p-1)}{(1-R^2)/(n-p)}\]

This shows the F-test is testing whether R² is significantly different from 0

General Linear Hypotheses

Framework: \(H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{d}\)

Where:

  • \(\mathbf{C}\) is a contrast matrix (\(q \times p\))
  • \(\mathbf{d}\) is a vector of constants
  • \(q\) is the number of restrictions

General Linear Hypotheses

Test statistic: \[F = \frac{(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})^T[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T]^{-1}(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})/q}{\text{RSS}/(n-p)}\]

Under \(H_0\): \(F \sim F_{q, n-p}\)

Examples of Linear Hypotheses

Single coefficient: \(H_0: \beta_1 = 0\) \[\mathbf{C} = [0, 1, 0, 0], \quad \mathbf{d} = 0\]

Equality of coefficients: \(H_0: \beta_1 = \beta_2\) \[\mathbf{C} = [0, 1, -1, 0], \quad \mathbf{d} = 0\]

Examples of Linear Hypotheses

Multiple restrictions: \(H_0: \beta_2 = \beta_3 = 0\) \[\mathbf{C} = \begin{bmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, \quad \mathbf{d} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]

The ANOVA Table

Source df Sum of Squares Mean Square F
Regression \(p-1\) SS\(_{\text{Reg}}\) MS\(_{\text{Reg}}\) MS\(_{\text{Reg}}\)/MSE
Error \(n-p\) RSS MSE
Total \(n-1\) TSS

Mean Squares

  • MS\(_{\text{Reg}}\) = SS\(_{\text{Reg}}\)/(p-1)
  • MSE = RSS/(n-p) = \(\hat{\sigma}^2\)

Understanding P-values

Definition: Probability of observing a test statistic as extreme or more extreme than what we observed, assuming \(H_0\) is true

For t-tests (two-sided): \[\text{p-value} = P(|T| \geq |t|) = 2 \times P(T \geq |t|)\]

Understanding P-values

Definition: Probability of observing a test statistic as extreme or more extreme than what we observed, assuming \(H_0\) is true

For F-tests (one-sided): \[\text{p-value} = P(F_{q,n-p} \geq f)\]